L03: Data structures

Lecture overview

We begin this lecture with a discussion of data structures. These are just collections of several pieces of data. Python provides a few different types of data structures, each with different properties (and hence, different use cases). We will talk about only the most commonly used ones in this course: lists, tuples, and dictionaries. Strings are also data structures (they are sequences of single characters) but we already covered them in the previous class.

In Python, different data structures (or basic data types like int or float) come with different “attributes” (pieces of code that compute something using the data in that data structure). These attributes and methods can be accessed using the “dot” notation (object, followed by a . followed by the attribute name or method name). We will cover these concepts in the second part of the lecture.

Lists

Lists are sequences of several pieces of data that are mutable (can be changed) and ordered.

Lists are created using square brackets, separating its elements (members) with commas:

list1 = [1, 2]
print(list1)

[1, 2]

print(type(list1))

<class 'list'>

Lists can contain elements of different types:

list2 = [10, 10.5, 'abc', False, list1] #an int, a float, a str, a bool, and a list
print(list2)

[10, 10.5, 'abc', False, [1, 2]]

Elements (members) of a list can be accessed using the same slicing techniques we used for strings:

print(list2[2])

abc

print(list2[-2]) #second from the end

False

print(list2[1:3])

[10.5, 'abc']

Because Pyhton starts counting from 0, the above notation usually causes some confusion: “How come 1:3 gives me the second and third elements? One way to remember this is to think of 1 and 3 as endpoints”between” the elements of the list: 1 is an endpoint between the first and second elements, and 3 is between the third and fourth elements. 1:3 asks python for all the elements of the list between those end points (which means the second and third elements).

Note that you can omit one of the endpoints in a slice:

print([10,20,30][:2]) #this is short for print([10,20,30][0:2])

[10, 20]

print([10,20,30][2:]) #this is short for print([10,20,30][2:None])

[30]

And remember that negative positions mean “starting from the end”:

print(['a','b','c'][1:-1])

['b']

print(['a','b','c'][-3:-1]) #note, when starting from the back, -1 represents the first element (not -0)

['a', 'b']

Remember, lists are ordered. This means that two lists with the same elements but in different order, are not the same:

a = [1,2]
b = [2,1]
print(a==b)

False

List are mutable. That means you can change the value of their elements:

a = [1,2]
a[1] = 'a'
print(a)

[1, 'a']

Common operators for lists (+, *, in) work as they do for strings:

print([1,2] + ['a','b'])

[1, 2, 'a', 'b']

print([1,2] * 3)

[1, 2, 1, 2, 1, 2]

print(2 in [1,2])

True

print([2] in [1,2])

False

print([2] in [1, [2]])

True

Tuples

Tuples are also sequences of several pieces of data. Tuples are also ordered, but they are NOT mutable (i.e. they are immutable: their elements can not be changed).

We create tuples using parentheses, with elements separated by commas:

t = (1, 'a', True, [5,6], (7,8))
print(t)

(1, 'a', True, [5, 6], (7, 8))

Everything we learned for lists can be performed for tuples, except for changing the value of any of its elements:

print(t[2:4])

(True, [5, 6])

print(t + ('c','d')) #note: this does NOT change t

(1, 'a', True, [5, 6], (7, 8), 'c', 'd')

print(t * 2)

(1, 'a', True, [5, 6], (7, 8), 1, 'a', True, [5, 6], (7, 8))

print('c' in t)

False

#This will not work
t[0] = 5

TypeError: 'tuple' object does not support item assignment

# But this does not give us an error
t = t + ('c','d') #you are not trying to change any of the existing elements of t
print(t)

(1, 'a', True, [5, 6], (7, 8), 'c', 'd')

Dictionaries (dict)

Python dictionaries are data structures that allow us to collect key-value pairs. This is similar to how a real dictionary is structured: words are “keys” and their definitions are “values”

We construct dictionaries using curly brackets, with key-value pairs separated by commas, and each key separated by its value with a colon (:)

d = {'k1': 1, 'k2': 'abc', 5: [6,7]}
print(d)

{'k1': 1, 'k2': 'abc', 5: [6, 7]}

In the example above, the “keys” of the dictionary are ‘k1’, ‘k2’, and 5 (note that they do not have to be of the same type). The “values” of the dictionary are 1, ‘abc’, and [6,7] (values also do not have to have the same type.

The main difference between dictionaries and lists is in the way we access their elements. For lists, we use the position of that element in the list. For dictionaries, we use the key of the value we want to retrieve:

print(d['k1'])

#This will not work
print(d[0])

KeyError: 0

#But why does this not give me an error?
print(d[5]) #note: this is NOT the fifth entry in the dictionary. it's the value corresponding to the key named 5

[6, 7]

There are no restrictions on what the values in a dictionaries can be, but keys must be unique. If not, the last entry will be the only one recorded:

mk = {'a': 10, 'b': 20, 'a':30}
print(mk)

{'a': 30, 'b': 20}

Adding and entry to a dictionary:

mk['c'] = 40
print(mk)

{'a': 30, 'b': 20, 'c': 40}

Changing an entry:

mk['a'] = 50
print(mk)

{'a': 50, 'b': 20, 'c': 40}

Removing an entry:

del mk['b']
print(mk)

{'a': 50, 'c': 40}

Note that if you try to run the cell above TWICE, the second time, it will give you an error, because it tries to delete the entry for key ‘b’ but it can’t find it, since you already deleted it the first time you ran the code.

Attributes, and the dot notation

Objects of different types (either basic data types like int and float or data structures like lists or dicts) have a number of predefined attributes, which allow you to compute something about that object. You may also see these attributes referred to as methods (for the most part, the two terms are interchangeable).

We can list out all the attributes of a particular object using the dir command:

mylist = ['a','b','c']
print(dir(mylist))

['__add__', '__class__', '__contains__', '__delattr__', '__delitem__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__iadd__', '__imul__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__rmul__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', 'append', 'clear', 'copy', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort']

Note that we would get the same answer for any other list (i.e. available attributes are the same for all objects of a give type):

anotherlist = [1,2]
print(dir(anotherlist))

['__add__', '__class__', '__contains__', '__delattr__', '__delitem__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__iadd__', '__imul__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__rmul__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', 'append', 'clear', 'copy', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort']

And the result is different for different kinds of data types:

#Attributes of strings
print(dir("abc"))

['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'capitalize', 'casefold', 'center', 'count', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum', 'isalpha', 'isascii', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']

#Attributes of dictionaries
print(dir({'a':1}))

['__class__', '__contains__', '__delattr__', '__delitem__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', 'clear', 'copy', 'fromkeys', 'get', 'items', 'keys', 'pop', 'popitem', 'setdefault', 'update', 'values']

Each of these attributes does something else. For example, in the list of dictionary attributes above, ‘keys’ gives us a list of all the keys of that dictionary. To use the ‘keys’ attribute, we write the name of that attribute after the name of the dictionary we want to apply it to, separated by a dot:

mydict = {'a': 1, 
          'b': 2}

k = mydict.keys() 
print(k)

dict_keys(['a', 'b'])

There are different rules that you need to follow for each attribute, in terms of what you need to write AFTER the name of the attribute (e.g. putting parentheses after keys in the example above, is mandatory, otherwise you will get an error message). These rules are referred to as the syntax of that attribute.

There is a very lager number of attributes in python and you are not expected to remember any of the rules associated with those attributes. There will be some that we will use so often that you will just remember how to use them naturally, but you will not be required to. This is the reason why programmers always have a Google tab open: you will constantly have to search how you’re supposed to use a particular attribute.

We will introduce more attributes as we need them throughout the rest of the semester. This section was just meant to introduce you to the concept and make you familiar with the dot notation. This notation is also used to access subpackages of Python packages. We will talk more about this when we cover packages in more detail later on.

Python built-in functions

Python has a set of built-in functions that are not specific to any given data type (like the attributes we mentioned above). We have already used several such functions: “print”, “type”, “dir”. A full list of built-in functions, as well as the syntax for all of them can be found here: https://docs.python.org/3/library/functions.html

Note that, to use these built-in functions, we do not use the dot notation mentioned above. Instead, we pass the data on which we want to operate as a “parameter” to the function (i.e. inside parentheses, after the function name).

For example, we write:

print("abc")

abc

Not:

"abc".print()

AttributeError: 'str' object has no attribute 'print'

We will talk more about specific built-in functions as we need them later on in the course. If you want to take a look ahead of time, some of the most commonly used ones are: range(), abs(), list(), dict(), len(), round(), sum(), str(), zip().